Discovering Descriptive Tile Trees - By Mining Optimal Geometric Subtiles

نویسندگان

  • Nikolaj Tatti
  • Jilles Vreeken
چکیده

When analysing binary data, the ease at which one can interpret results is very important. Many existing methods, however, discover either models that are difficult to read, or return so many results interpretation becomes impossible. Here, we study a fully automated approach for mining easily interpretable models for binary data. We model data hierarchically with noisy tiles—rectangles with significantly different density than their parent tile. To identify good trees, we employ the Minimum Description Length principle. We propose Stijl, a greedy any-time algorithm for mining good tile trees from binary data. Iteratively, it finds the locally optimal addition to the current tree, allowing overlap with tiles of the same parent. A major result of this paper is that we find the optimal tile in only Θ(NM min(N,M)) time. Stijl can either be employed as a top-k miner, or by MDL we can identify the tree that describes the data best. Experiments show we find succinct models that accurately summarise the data, and, by their hierarchical property are easily interpretable.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

FRECLE Mining: Discovering Frequent Semantic Tree Cluster Sequences from Historical Tree Structured Data

Mining frequent trees is very useful in domains like bioinformatics, web mining, mining semistructured data, and so on. Existing techniques focus on finding “structural” patterns and ignores the “semantics” that may be associated with the subtrees. In this paper we proposal an algorithm to mine a novel pattern called frequent semantic tree cluster sequences (FRECLE), which captures the frequent...

متن کامل

Mining of Users’ Access Behaviour for Frequent Sequential Pattern from Web Logs

Sequential Pattern mining is the process of applying data mining techniques to a sequential database for the purposes of discovering the correlation relationships that exist among an ordered list of events. The task of discovering frequent sequences is challenging, because the algorithm needs to process a combinatorially explosive number of possible sequences. Discovering hidden information fro...

متن کامل

Discovering Frequent Embedded Subtree Patterns from Large Databases of Unordered Labeled Trees

Recent years have witnessed a surge of research interest in knowledge discovery from data domains with complex structures, such as trees and graphs. In this paper, we address the problem of mining maximal frequent embedded subtrees which is motivated by such important applications as mining “hot” spots of Web sites from Web usage logs and discovering significant “deep” structures from tree-like...

متن کامل

Efficient Substructure Discovery from Large Semi-structured Data

By rapid progress of network and storage technologies, a huge amount of electronic data such as Web pages and XML data [23] has been available on intra and internet. These electronic data are heterogeneous collection of ill-structured data that have no rigid structures, and often called semi-structured data [1]. Hence, there have been increasing demands for automatic methods for extracting usef...

متن کامل

Efficient Tree Mining Using Reverse Search

In this paper, we review our data mining algorithms for discovering frequent substructures in a large collection of semi-structured data, where both of the patterns and the data are modeled by labeled trees. These algorithms, namely FREQT for mining frequent ordered trees and UNOT for mining frequent unordered trees, efficiently enumerate all frequent tree patterns without duplicates using reve...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012